Achieving Load Balancing of HDFS Clusters Using Markov Model
نویسنده
چکیده
The combination of Hadoop and HDFS is becoming a defacto standard system in handling big data. HDFS is a distributed file system that is designed for big data. In HDFS, a file consists of multiple large sized blocks. A central management of HDFS tries to scatter these multiple blocks on different nodes to maximize the I/O throughput. Hadoop is a framework that supports data intensive parallel applications and runs on the top of HDFS. In Hadoop, a user-level job is usually split into small tasks and each task is assigned to cluster nodes where necessary data is placed. If the cluster nodes with necessary data is fully occupied by other jobs, the task is assigned to other clusters that might not hold necessary input data. In this case, these clusters need to transfer the input data from the other nodes at the cost of network traffic. Hadoop job scheduling is basically designed to reduce the network traffic. In other words, Hadoop prefers the nodes that holds the necessary input data when doing job scheduling. The ill-balanced placement of hot data would increase network traffic and affect the overall system performance as well. Therefore, well-balanced placement of hot data is critical to improve Hadoop performance. To achieve well-balance of hot data, the behaviors of system are monitored and some static rules are enforced by human beings in run time. The migration under these static rules tends to be conservative and progressed in slow pace in order to minimize the network traffic overhead. These characteristics of static migration make it unsuitable for the case of adding new cluster nodes, which requires rapid data migration and usually large volume of migration. In this study, we use Markov Model to achieve rapid migration of large data to new nodes.
منابع مشابه
Load Balancing and Scheduling of Tasks in Parallel Processing Environment
This proposed model applies to scheduling and doing load balancing of the tasks in the parallel processing applications which can be divided into parts of arbitrary sizes, which in turn can be processed independently on remote computers. The objective is to come up with the optimal divisible load scheduling and balancing model to solve a computational problem in a minimal amount of time (with t...
متن کاملLoad Balancing for Parallel Loops in Workstation Clusters
Load imbalance is a serious impediment to achieving good performance in parallel processing. Global load balancing schemes cannot adequately manage to balance parallel tasks generated from a single application. Dynamic loop scheduling methods are known to be useful in balancing parallel loops on shared-memory multiprocessor machines. However, their centralized nature causes a bottleneck even fo...
متن کاملApplying Machine Learning Methods for Time Series Forecasting
This paper describes a strategy on learning from time series data and on using learned model for forecasting. Time series forecasting, which analyzes and predicts a variable changing over time, has received much attention due to its use for forecasting stock prices, but it can also be used for pattern recognition and data mining. Our method for learning from time series data consists of detecti...
متن کاملA Markov Chain Model for Load-Balancing Based and Service Based RAT Selection Algorithms in Heterogeneous Networks
Next Generation Wireless Network (NGWN) is expected to be a heterogeneous network which integrates all different Radio Access Technologies (RATs) through a common platform. A major challenge is how to allocate users to the most suitable RAT for them. An optimized solution can lead to maximize the efficient use of radio resources, achieve better performance for service providers and provide Qual...
متن کاملDesign and Analysis of a Dynamically Reconfigurable Shared Memory Cluster
In recent years, the clusters have become a viable and less expensive alternative to multiprocessor systems. This paper proposes an architecture with a load balancing and a fault tolerant model for shared memory clusters. A task clustering algorithm, a Centralized dynamic load balancing model, a load balancing algorithm and a fault tolerant model are proposed for shared memory clusters. The res...
متن کامل